The METER Corpus: A corpus for analysing journalistic text reuse

نویسندگان

  • Robert Gaizauskas
  • Jonathan Foster
  • Yorick Wilks
  • John Arundel
  • Paul Clough
  • Scott Piao
چکیده

As a part of the METER (MEasuring TExt Reuse) project we have built a new type of comparable corpus consisting of annotated examples of related newspaper texts. Texts in the corpus were manually collected from two main sources: the British Press Association (PA) and nine British national newspapers that subscribe to the PA newswire service. In addition to being structured to support efficient search for related PA and newspaper texts, the corpus is annotated at two levels. First, each of the newspaper texts is assigned one of three coarse, global classifications indicating its derivation relation to the PA: wholly derived, partially derived or non-derived. Second, about 400 wholly or partially derived newspaper articles are annotated down to the lexical level, indicating for each phrase, or even individual word, whether it appears verbatim, rewritten or as new material. We envisage that this corpus will be of use for a variety of studies, including detection and measurement of text reuse, analysis of paraphrase and journalistic styles, and information extraction/retrieval. To illustrate these potential uses we briefly describe some work we have done with the corpus to develop algorithms for detecting text reuse.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Building and annotating a corpus for the study of journalistic text reuse

In this paper we present the METER Corpus, a novel resource for the study and analysis of journalistic text reuse. The corpus consists of a set of news stories written by the Press Association (PA), the major UK news agency, and a set of stories about the same news events, as published in various British newspapers. In some cases the newspaper stories are rewritten from the PA source; in other ...

متن کامل

Using the XARA XML-Aware Corpus Query Tool to Investigate the METER Corpus

The METER (MEasuring TExt Reuse) corpus is a corpus designed to support the study and analysis of journalistic text reuse. It consists of a set of news stories written by the Press Association (PA), the major UK news agency, and a set of stories about the same news events, as published in various British newspapers, some of which were derived from the PA version and some of which were written i...

متن کامل

Transitions thématiques : Annotation d'un corpus journalistique et premières analyses (Manual thematic annotation of a journalistic corpus : first observations and evaluation) [in French]

Manual thematic annotation of a journalistic corpus : first observations and evaluation. The work presented in this paper focuses on the creation of a corpus of journalistic texts annotated at dicourse level, more precisely on a topic level. The annotation model is a classic segmentation one, to which we add transition zones between topical units. We assume that in a well-structured text, the a...

متن کامل

Postgraduate Transfer Report.PDF

This thesis builds upon our current understanding of text reuse by proposing a hypothetical framework of text reuse and applying this abstract definition to a specific domain, that of journalistic reuse. The framework aims to explore a suitable measure of reuse and determine suitable discriminators for document derivation. Although text can be reused verbatim (word-for-word), in most cases, tex...

متن کامل

پیکره اعلام: یک پیکره استاندارد واحدهای اسمی برای زبان فارسی

Named entity recognition (NER) is a natural language processing (NLP) problem that is mainly used for text summarization, data mining, data retrieval, question and answering, machine translation, and document classification systems. A NER system is tasked with determining the border of each named entity, recognizing its type and classifying it into predefined categories. The categories of named...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2001